Guillermo Moncecchi
This notebook contains just some code notes on Principal Component Analysis, based on a very simple example. If you want the original, much better explanation, refer to this page from Sebastian Raschka.
In [3]:
import numpy as np
%matplotlib inline
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from mpl_toolkits.mplot3d import proj3d
np.set_printoptions(suppress=True,precision=3)
from matplotlib.patches import FancyArrowPatch
Let us define a very simple dataset, with 6 people, identified by their heigth, weigth, and the length of their middle finger (?). Each column is an instance, with 3 attributes (we use this format to directly call the np.cov() function
In [4]:
X_men=np.array([[1.97,110,5],[1.80,70,4.8],[1.70,90,4.9]]).transpose()
X_women=np.array([[1.65,52,4.7],[1.75,65,4.8],[1.67,58,4.6]]).transpose()
X = np.hstack((X_men,X_women))
print (X)
In [5]:
mean=np.mean(X,axis=1)
std=np.std(X,axis=1)
var=np.var(X,axis=1)
print ("Means:",mean)
print ("Standard deviations:",std)
print ("Variances:",var)
Rescale values
In [35]:
def scale_linear_byrow(X):
mins = np.min(X, axis=1).reshape(3,1)
maxs = np.max(X, axis=1).reshape(3,1)
rng = maxs - mins
return ((X - mins)/rng)
X_r=scale_linear_byrow(X)
print(X_r)
In [36]:
fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(111, projection='3d')
plt.rcParams['legend.fontsize'] = 10
ax.plot(X_r[0,0:3], X_r[1,0:3], X_r[2,0:3], 'o', markersize=8, color='red', alpha=0.5, label='class1')
ax.plot(X_r[0,3:6], X_r[1,3:6], X_r[2,3:6], 'o', markersize=8, color='blue', alpha=0.5, label='class1')
plt.show()
Calculate mean, standard deviation and variance for every attribute, using the axis property of each of the methods
In [37]:
mean=np.mean(X_r,axis=1)
std=np.std(X_r,axis=1)
var=np.var(X_r,axis=1)
print ("Means:",mean)
print ("Standard deviations:",std)
print ("Variances:",var)
#mean_vector = np.array([[mean_x],[mean_y],[mean_z]])
Now, obtain the covariance matrix for our sample
In [38]:
cvm=np.cov(X_r)
print (cvm)
Calculate eigenvalues and eigenvectors from the covariance matrix. Since it is symmetric, we know they are real numbers
In [39]:
eig_val_cov, eig_vec_cov = np.linalg.eig(cvm)
for i in range(len(eig_val_cov)):
eigvec_cov = eig_vec_cov[:,i].reshape(1,3).T
print('Eigenvalue {} from covariance matrix: {}'.format(i+1, eig_val_cov[i]))
print('Eigenvector:')
print(eigvec_cov)
Verify that the eigenvalue equation holds:
In [40]:
for i in range(len(eig_val_cov)):
eigv = eig_vec_cov[:,i].reshape(1,3).T
np.testing.assert_array_almost_equal(cvm.dot(eigv), eig_val_cov[i] * eigv,
decimal=6, err_msg='', verbose=True)
In [41]:
# Sort eigenvectors given their eigenvalue
# Make a list of (eigenvalue, eigenvector) tuples
eig_pairs = [(np.abs(eig_val_cov[i]), eig_vec_cov[:,i]) for i in range(len(eig_val_cov))]
# Sort the (eigenvalue, eigenvector) tuples from high to low
eig_pairs.sort()
eig_pairs.reverse()
# Visually confirm that the list is correctly sorted by decreasing eigenvalues
for i in eig_pairs:
print(i)
# print(i[0])
In [42]:
# Just to principal components
matrix_w = np.hstack((eig_pairs[0][1].reshape(3,1), eig_pairs[1][1].reshape(3,1)))
print('Matrix W:\n', matrix_w)
Transform instances to the new subspace...
In [43]:
# Transform instance to the new subspace
transformed = np.dot(X_r.T,matrix_w).transpose()
print (transformed)
In [48]:
plt.plot(transformed[0,0:3], transformed[1,0:3], 'o', markersize=7, color='blue', alpha=0.5, label='men')
plt.plot(transformed[0,3:6], transformed[1,3:6], '^', markersize=7, color='red', alpha=0.5, label='women')
#plt.xlim([-4,4])
#plt.ylim([-4,4])
plt.xlabel('x_values')
plt.ylabel('y_values')
#plt.legend()
plt.title('Transformed samples with class labels')
plt.show()
In [45]:
# Just one dimension
matrix_w2 = np.hstack((eig_pairs[0][1].reshape(3,1),))
print('Matrix W:\n', matrix_w2)
In [46]:
transformed2 = np.dot(X_r.T,matrix_w2).transpose()
print (transformed2)
In [47]:
plt.plot(transformed2[0,0:3],[0,0,0], 'o', markersize=7, color='blue', alpha=0.5, label='men')
plt.plot(transformed2[0,3:6],[0,0,0], '^', markersize=7, color='red', alpha=0.5, label='women')
#plt.xlim([-4,4])
#plt.ylim([-4,4])
plt.xlabel('x_values')
plt.ylabel('y_values')
plt.legend()
plt.title('Transformed samples with class labels')
plt.show()